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ABSTRACT 


Sharing sensitive data is a specific challenge for research infrastructures in the field of life sciences. For 
that reason a toolbox has been developed, providing resources for researchers who wish to share and use 
sensitive data, to support the workflows for handling these kinds of digital objects. Common and community 
approved annotations are required to be compliant with FAIR principles (Findability, Accessibility, 
Interoperability, Reusability). The toolbox makes use of a tagging (categorisation) system, allowing consistent 
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labelling and categorisation of digital objects, in terms relevant to data sharing tasks and activities. A pilot 
study was performed within the Horizon 2020 project EOSC-Life, in which 2 experts from 6 life sciences 
research infrastructures were recruited to independently assign tags to the same set of 10 to 25 resources 
related to sensitive data management and data sharing (in total 110). Summary statistics of agreement and 
observer variation per research infrastructure are provided. The pilot study has shown that experts were able 
to attribute tags but in most cases with a considerable observer variation between experts. In the context of 
CWFR (Canonical Workflow Frameworks for Research), this indicates the necessity for careful definition, 
evaluation and validation of parameters and processes related to workflow descriptions. The results from 
this pilot study were used to tackle this issue by revising the categorisation system and providing an 
updated version. 


1. INTRODUCTION 


The Horizon 2020 cluster project EOSC-Life brings together the 13 Life Science ‘ESFRI’ (European 
Strategy Forum on Research Infrastructures) research infrastructures (Rls) to create an open, digital and 
collaborative space for biological and medical research (https://www.eosc-life.eu/). Sharing sensitive data 
is a specific challenge within EOSC-Life. For that reason a toolbox has been developed, providing resources 
for researchers who wish to share and use sensitive data, to support the workflows for handling these kinds 
of digital objects (1). The sensitivity of the data normally stems from it being personal data, but can also 
be caused by intellectual property considerations, biohazard concerns, or because it falls within rules or 
protocols such as the Nagoya protocol (2). The toolbox does not create new content. Instead, it allows 
researchers to find existing resources that are relevant for sharing sensitive data across all participating 
research infrastructures (F in the FAIR Guiding Principles for scientific data management and stewardship, 
providing guidelines to improve the Findability, Accessibility, Interoperability, and Reuse of digital assets 
(3)). The toolbox provides links to recommendations, procedures, and best practices, as well as to software 
(e.g., tools and scripts supporting workflow) to support data sharing and reuse. It makes use of a tagging 
system, allowing consistent labelling and categorisation of digital objects, in terms most relevant to data 
sharing tasks and activities. The first version of the categorisation system was developed and its description 
published before executing the study (4). Components of the system were based upon existing classification 
systems (5-11). As a result of this pilot study, the categorisation system has been improved and simplified. 


The work is relevant i) for developing ideas about tagging workflows within CWFR (Canonical Workflow 
Framework for Research) and ii) to the characterisation and use of FAIR digital objects (12). Handling of 
sensitive data and digital objects is a major issue and a challenging topic for many research infrastructures 
in the life sciences. The workflows around sensitive data are extremely complex and highly dependent on 
the regulatory environment. The toolbox is targeted at the discovery of resources dealing with sensitive data 
through a tagging system, which characterises the resources using cross-community approved categories. 
One of these categories provides the stage of digital objects in the data sharing life cycle. The toolbox also 
characterises, in broad terms, the data type investigated, contributing to the description of data collections 
as required in the context of CWFR. Because the work is spanning all life-science infrastructures, it 
also offers valuable input into the discussion of the applicability of generic workflow fragments across 
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infrastructures and an inter Rls approved tagging workflow system. Thus, the toolbox and the underlying 
categorisation system will contribute to the further development of CWER in the area of digital objects with 
sensitive data. 


2. RELATED WORK 


Various resources have been developed to support the management of sensitive data. These include codes 
of conduct, guidelines, recommendations, policies, and descriptions of best practice, but also computer 
tools and services (e.g., for de-identification of data). For identifying these resources, various databases, 
catalogues and repositories are available and can be used, often with their own internal tagging system. 
For specific areas dedicated portals have been developed and implemented, such as the ELSI (Ethical, Legal 
and Social Implications) Knowledge Base from BBMRI (Biobanking and Biomolecular Resources Research 
Infrastructure) (https://www.bbmri-eric.eu/elsi/knowledge-base/) or the RDA (Research Data Alliance) 
COVID-19 Recommendations and Guidelines for Data Sharing (https:/Awww.rd-alliance.org/group/rda- 
covid1 9-rda-covid1 9-omics-rda-covid-19-epidemiology-rda-covid1 9-clinical-rda-covid19-0). The latter 
incorporates a tagging system developed to characterise documents, allowing better support for searching 
and filtering. What is still missing is a system providing guided access to resources dealing with sensitive 
data in general and spanning the full area of life sciences. 


3. PROBLEM FORMULATION 


A large part of the compliance with FAIR principles, and the creation of minimal metadata sets required 
for FAIR data, are driven by community level processes (17, 18). Any toolbox being developed should be 
based on common and community approved metadata, and it should refer to standardised workflows and 
an integrated approach to arrive at community approval for FDO and CWER concept requirements. The 
categorisation system under study is a crucial part of these common, appropriate and sufficient metadata 
(19). The objective of the pilot study was to evaluate the first version of the categorisation system by human 
experts from different life science Rls with respect to usefulness, reliability and consistency. From the results 
of the study an improved version of the categorisation system was provided and this will be used as a 
central component within the toolbox to support rapid and intuitive retrieval of resources by future users. 
Additionally, the study was intended to provide insights into handling discrepancies in developing and 
applying metadata, through a defined testing and community approval process. 


4. APPROACH AND METHODOLOGY 


The pilot study was performed within the EOSC-Life Work Package (WP) dedicated to sensitive data, 
supported by the partners of this WP and coordinated by the European Clinical Research Infrastructure 
Network (ECRIN) (https://ecrin.org/). The study protocol was registered before the study started on 8 
December 2020 (16). The pilot study was organised using a strict methodology. Each involved infrastructure 
nominated two experts, willing to perform the individual assessment of a number of typical resources 
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around sensitive data to be included in the EOSC-Life toolbox. The experts from the involved infrastructures 
selected a given number of resources—25 for each of ECRIN, BBMRI, and EATRIS (European Advanced 
Translational Research Infrastructure in Medicine), and at least 10 for each of EMBRC (European Marine 
Biological Resource Centre), ERINHA (European Research Infrastructure on Highly Pathogenic Agents), and 
Euro-Bioimaging. In all cases they spanned a wide range of resource types (e.g. legislation & regulations, 
position papers, policies & principles, background & explanatory material, tools). The experts selected the 
resource types that were relevant for their research infrastructure and then assessed these resources using 
the categorisation system independently of each other. 


As a technologically agnostic first step, the evaluation was supported through the bibliographic tool 
Zotero Tags (https://www.zotero.org/). The categorisation system used in the pilot study consisted of eight 
dimensions: resource type, research field, research design, data type, stage in data sharing life cycle, 
geographical scope, specific topics or keywords and targeted group (see the short form of categorisation 
system in supporting material S1 and (4)). It was a requirement that all dimensions of the categorisation 
system were applied to each resource. Multiple tags per dimension were possible. 


Summary statistics of agreements/disagreements between the two experts of each research infrastructure 
were generated. Separately for each research infrastructure and per category, cases where both experts 
agreed to assign a tag to a resource (yes-yes) and cases where one expert used a tag and the other not 
(yes-no or no-yes) were counted. The agreement rate was calculated as the percentage of “yes-yes” on all 
tags that were used at least once. For measuring interrater reliability between the assessment of experts of 
the same infrastructure, the kappa coefficient developed by Fleiss was applied (17). The statistical analysis 
was performed using the statistical software R version 4.0.2. 


5. EXPERIMENTS AND ANALYSIS 


The agreement rates between experts per infrastructure and category are presented in the supporting 
material (see supplementary material S2: Final study report). The rate was very high for BBMRI with 70.2%. 
For the remaining research infrastructures, the agreement rates were considerably lower with a range 
between 33.2% (ECRIN) and 20.2% (EMBRC). The agreement rates differed between the categories. Rates 
higher than 40% were achieved for category 2—research field (49.3%), category 6—geographical scope 
(44.5%) and category 1—resource type (41.8%). The lowest rates were measured for category 3—research 
design (18.3%) and category 5—stage in data sharing life cycle (25.1%). It is important to note here that 
some experts assigned many more tags from specific categories than other experts, contributing significantly 
to the disagreement. 


The findings were confirmed by the interobserver reliability analysis (Table 1). A high kappa value was 
found for BBMRI (median kappa for 25 resources: 0.84). The kappa values were much lower for the other 
research infrastructures with a range between 0.44 for ECRIN and 0.22 for EMBRC (median kappa value). 
For BBMRI only one assessment was lower than 0.5 and only 9 assessments lower than 0.8. For ECRIN, 
assessment of 6 out 25 resources resulted in a kappa >= 0.50. This was the case only for 2 out of 12 
resources for EMBRC, for 2 out of 25 for EATRIS and for 1 out of 10 for ERINHA. All the other assessments 
resulted in kappa’s less than 0.50. The results are summarised in Table 1. 
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Table 1. Interobserver reliability (Kappa values). 


Research infrastructure 


Resource* 
BBMRI EATRIS ECRIN EMBRC ERINHA Euro-Bio Imaging 
1 0.79 0.47 0.39 0.09 0.33 0.18 
2 0.81 0.32 0.31 0.11 0.29 0.20 
3 0.84 0.36 0.13 0.24 0.55 0.33 
4 1 0.24 0.73 0.08 0.12 0.33 
5 0.54 0.21 0.26 0.22 0.03 0.24 
6 0.10 0.36 0.34 0.16 0.42 0.28 
7 0.86 0.18 0.47 0.55 0.28 0.30 
8 0.64 0.42 0.45 0.42 0.41 0.26 
9 0.75 0.14 0.20 0.51 0.12 0.36 
10 0.94 0.20 0.47 0.22 0.31 0.33 
11 0.81 0.44 0.79 -0.04 - 0.38 
12 0.78 0.45 0.45 0.33 - 0.26 
13 0.86 0.32 0.44 - - 0.30 
14 1 0.20 0.52 - - - 
15 0.86 0.24 0.24 - - - 
16 0.95 0.24 0.55 - - - 
17 0.64 0.62 0.44 - - - 
18 1 0.20 0.22 - - - 
19 0.84 0.33 0,28 - - - 
20 0.90 0.20 0.36 - - - 
21 0.64 0.12 0.50 - - - 
22 0.81 0.36 0.33 - - - 
23 0.86 0.33 0.60 - - - 
24 0.84 0.24 0.64 - - - 
25 0.88 0.55 0.16 - - - 
Median 0.84 0.32 0.44 0.22 0.3 0.3 


Note: *Digital objects assessed were different between the research infrastructures <0 poor agreement, 0.0-0.20 slight agreement, 
0.21-0.40 fair agreement, 0.41-0.60 moderate agreement, 0.61-0.80 substantial agreement, and 0.81-1.0 almost perfect 
agreement (>= 0.5: bold). 


Several experts suggested new tags. This covered additional “resource types” that were not listed (e.g., 
webinar, template, Q&A), new “research fields” that were not included (e.g., law, ethics, sociology, genomic 
research), the answer “any” to “research design” and “data type”, more generic terms additional to specific 
“stages of data sharing” (e.g., data transfer, data sharing, secondary processing of data), additional “specific 
topics” (e.g., code of conduct, ELSI, COVID-19, material transfer agreement and additional “perspectives” 
(e.g., participant, resource provider). 


The results from this pilot study were used to revise the categorisation system, generating an improved 
and simplified updated version 2 (Figure 1 and supplementary material $3: Short form of the categorisation 
system (Version 2)). 
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Category 


Figure 1. Mindmap of short version of categorisation system (version 2) (See also: Supplementary material S3 for 
presentation as a table). 


6. DISCUSSION 


BBMRI had already performed major work related to the handling of sensitive data in the field of health 
research, so the high interobserver correspondence in the pilot study is not surprising. The BBMRI ELSI 
Knowledge Base had already been structured using resource categories that included topic, areas of interest 
and geographical scope. The resources assessed in the pilot study had been selected from this knowledge 
base and existing tags were used in the assessment. The resources selected by BBMRI are closely related 
to the BBMRI work and concentrate on ELSI aspects, covering biobanks, de-identification, ethical handling, 
GDPR, consultation, transfer agreements, informed consent, etc. In contrast, some of the research 
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infrastructures deliberately tried to select a wide variety of resources, to create a more demanding test of 


the categorisation system (e.g., EATRIS). 


With respect to the diverging results of the remaining five research infrastructures, the following 


mechanisms were identified as possible ways of improving the categorisation system, to achieve the goal 


of a canonical workflow for cross-disciplinary tagging in the life sciences: 


Training: The categorisation was discussed during several telephone conferences by the group and 
most experts involved in the assessment participated in the development of the system. Nevertheless, 
there was no systematic training on the application of the system before the pilot study. 

Standard definitions: In the development of the categorisation system, existing terminologies and 
classifications were taken into consideration (5-11). However, standardised definitions and a glossary 
were not provided. This was expressed as a deficit by several of the experts. The first study helped to 
showcase where the most misunderstandings and polysemic biases exist. 

Elimination of ambiguity: For some of the categories it was not totally clear how they should be 
applied to the resources. For example, category 8 (perspective) could be considered from the 
developer's perspective (the person who developed the resource) or from the user's perspective (the 
person at whom the resource is targeted). Similarly with category 6, a geographical scope could refer 
to the developers (e.g., produced in one country) or to the target users (e.g., global). 

Guiding the number of tags assigned for a category: The number of tags assigned to resources for a 
specific category was not uniform between experts. Ways of reducing these discrepancies (e.g. by 
putting a maximum on tag numbers) should be explored. 

Filling gaps: For several categories, important tags were seen as missing from the pre-specified list, 
for example, missing resource types (category 1), research fields (category 2) and specific topics 
(category 7) These gaps were considered in detail during the creation of version 2 of the tagging 
categories. 

Elimination of meaningless tags: For a considerable number of resources and most of the infrastructures 
“not applicable”, “not specified” or “not clear” was assigned to category 3 (research design). It seems 
to be that specific study types are only relevant for some research infrastructures (e.g., ECRIN, EATRIS, 
EMBRC). 

Simplifying research stages: It proved difficult to allocate specific stages in the data sharing life cycle 
(category 5) to a number of resources despite using a widely used classification of stages in the data 
sharing life cycle. As a consequence, more generic tags were suggested that cover broader phases. 
This finding is relevant for the discussion on CWER. 

Simplifying data type tags: For category 4 (data type), specific combinations of data types were 
selected for a considerable number of resources, indicating that more generic terms could improve 
applicability of the categorisation system. It may be worth exploring the idea of a hierarchy of tags, 
because in some cases both a generic and a specific term could apply. In addition, some specific 
data types are only relevant for some research infrastructures, in particular those dealing with human 
data. Again, this point is of relevance for CWFR. 
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Based on the analysis and comments of the experts, the categorisation system was recalibrated and 
simplified. Suggestions for additional tags were taken into consideration, but there was also a general 
request to keep things as simple as possible. In order to reduce the considerable interobserver misalignment 
between the experts, the definitions of the tags were clarified and the distinction between tags was improved. 
The value of the original 8 tagging dimensions was re-evaluated and it was concluded that two could be 
dropped. Where possible tags were combined to keep the number of available options as low as possible. 
The result is a categorisation system with 55 tagging values spread over 6 dimensions, as compared to the 
original 93 tags spread over 8 dimensions (Figure 1 and supplementary material S3: Short form of the 
categorisation system (version 2)). Nevertheless, discussion and further evaluation of the categorisation 
system is still ongoing, and improvements are under discussion for future versions. 


Relevance of the pilot study for CWFR and FDO (FAIR Digital Objects) typology 


e Data annotation workflows constitute a basic element for improving reuse (R1.3 of Wilkinson criteria) 
and a first step in building rich community approved metadata (F2 and F4) and FAIR vocabularies 
linked to other metadata (12 and 13; (3)). 

e The pilot study is relevant to proposals for cross-disciplinary methods in life sciences to define FDO 
typology that could be reused by the new FDO Semantics WG (FDO-SEM | FAIR Digital Objects 
Forum (fairdo.org)). These typologies will be necessary to evaluate what should be incorporated within 
FAIR metadata (F2 and R1). 

e CWER includes a discussion of the role of generic workflows and whether modelling across disciplines 
is possible (18). Developing the categorisation system for the toolbox has demonstrated an approach 
to tackling this problem in the context of the life-sciences infrastructures. This tagging system may 
be considered as a first level of interoperability for vocabulary typology (12). The pilot study has 
demonstrated, however, considerable differences between research infrastructures, which has to be 
taken into consideration in the discussion on future generic workflows. 

e Harmonisation around how data collections are being described is required in the context of CWFR 
(18). On a high level and across all life-sciences infrastructures, this issue has been investigated in 
the categorisation system by incorporating the dimension “data type”, covering various specific data 
types about or from living humans. Again, issues were observed that indicated the necessity for clear 
definitions of parameters and clear rules on how to apply them. 

e As the CWER initiative wants to discuss ways to reduce the large gap between technology and 
principles on the one hand and data practices in the labs on the other hand, the toolbox and its 
categorisation system constitute a practical example of how to improve interoperability between 
a large panel of (life sciences) disciplines. This exercise is based on real resources / digital objects, 
but as with other FAIRification processes it needs successive iterations and further community 
approval (19). 

e The toolbox improves the discoverability (F in FAIR) of digital objects linked to sensitive data and 
thus will strengthen the provision of FAIRer digital objects because systematic application of a 
categorisation methodology will yield in metadata enrichment (for F4, and R1.3 compliance). This is 
a critical aspect for the handling of sensitive data in life sciences and to arrive at the necessary 
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standard solutions and workflows to comply with rules and regulations. The discussion around CWFR 
and FDO should be informed that this type of data is likely to need specific workflows to handle it. 
The toolbox will become a major resource to support this work by referencing relevant resources (F4). 

e Consideration should also be given to the tagging workflow itself, and how it might be better 
standardised. Elements of this tagging process are: Selection of resources to be tagged, assigning 
taggers to resources, performing the tagging (under standardised conditions), quality control and 
approval of the tagging, and monitoring and supervision of the process. The pilot study demonstrated 
major variation in tagging of resources if independent taggers are assessing the same resource (inter- 
observer variation). The example of BBMRI has shown that with adequate measures this inter-observer 
variation can be considerably reduced. Another source of variance may occur if a tagger assesses the 
same resource at different time points. In order to come to a valid and reliable tagging workflow, both 
inter- and intra-observer variation have to be taken into account and be reduced. For example, the 
workflow of the tagging process should include an independent assessment by a second reviewer 
and a final consensus by both experts. An intercalibration tool for future taggers (with a training set) 
could be provided based on the first set of digital objects. Recurrent training could help to make the 
tagging workflow more replicable, although the tagging system itself should be mostly self-explanatory, 
as end users of the toolbox have to use the same system for querying purposes. 

e In summary, the work is of major relevance for CWFR and FDO. An FDO-enabled application is able 
to identify the type of object (through reference to the object’s type), directly operate on the object 
(through the object's location reference) or get more information about the object (through reference 
to the objects metadata records). A prerequisite is good, appropriate and sufficient metadata for an 
FDO, in our case characterising digital objects related to managing sensitive data. The categorisation 
system developed in the pilot study could evolve into a “methodological resources in life sciences” 
FDO type and could thus be of major help by linking services to digital objects. So far, FDO types 
are not well described in the life sciences and work has only been started in the FDO forum. 


7. CONCLUSIONS AND FUTURE WORK 


The first thing we demonstrated was that to have a relatively stable and generally applicable categorisation 
system across life sciences Rls, an iterative process was necessary to align the variation between experts 
from different Rls. As the categorisation system is part of an essential minimal metadata in this interdisciplinary 
study, this is of relevance for a “resources in life sciences” FDO and may also be significant for other FDO 
types (in particular more generic “FDO resources” type), such as FDOs describing workflows at a canonical 
level, particularly in the CWER. In this context an important way to promote FAIRness is to request that all 
canonical components should support the concept of FDO (https://fairdo.org/wg/fdo-cwfr/). Secondly, the 
results from our study show that it will not be simple to define widely agreed data types across research 
domains and that pragmatic approaches are needed, which may carry the risk of a proliferation of data 
types. Thirdly, the pilot study demonstrated that highly sophisticated algorithms will be required to develop 
reliable automatic methods to interpret the categories relevant for sensitive data workflows (and indeed in 
the CWER). It may be difficult, even impossible, to find purely machine-actionable algorithms for categorising 
digital objects dealing with managing sensitive data. 
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Avoiding polysemic bias is one of the more difficult issues to resolve and is necessary across disciplines. 
The results from this pilot study have been used to tackle this issue by revising the categorisation system 
and providing an updated version 2 (Figure 1, supplementary material S3). This version contains definitions 
for each term of the categorisation system, though these need to be further evaluated and approved by all 
life-sciences infrastructures. Currently, the 110 resources from the pilot study are being re-tagged with the 
new version of the categorisation system. This initial set of digital objects will be used as the content for 
the first demonstrator of the toolbox. This is a starting point, but the categorisation system will need iterative 
improvement. The pilot study on tagging inter-calibration could serve as a model for a future CWFR cross- 
disciplinary workflow tagging system. 
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